In [7]:
# Import all of the things you need to import!
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

Homework 14 (or so): TF-IDF text analysis and clustering

Hooray, we kind of figured out how text analysis works! Some of it is still magic, but at least the TF and IDF parts make a little sense. Kind of. Somewhat.

No, just kidding, we're professionals now.

Investigating the Congressional Record

The Congressional Record is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?

Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from this page here.


In [2]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9607k  100 9607k    0     0  6136k      0  0:00:01  0:00:01 --:--:-- 6142k

In [3]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz

You can explore the files if you'd like, but we're going to get the ones from convote_v1.1/data_stage_one/development_set/. It's a bunch of text files.


In [4]:
# glob finds files matching a certain filename pattern
import glob

# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]


Out[4]:
['convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327025_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327044_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327046_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_1479036_DON.txt']

In [5]:
len(paths)


Out[5]:
702

So great, we have 702 of them. Now let's import them.


In [8]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()


Out[8]:
content filename pathname
0 mr. chairman , i thank the gentlewoman for yie... 052_400011_0327014_DON.txt convote_v1.1/data_stage_one/development_set/05...
1 mr. chairman , i want to thank my good friend ... 052_400011_0327025_DON.txt convote_v1.1/data_stage_one/development_set/05...
2 mr. chairman , i rise to make two fundamental ... 052_400011_0327044_DON.txt convote_v1.1/data_stage_one/development_set/05...
3 mr. chairman , reclaiming my time , let me mak... 052_400011_0327046_DON.txt convote_v1.1/data_stage_one/development_set/05...
4 mr. chairman , i thank my distinguished collea... 052_400011_1479036_DON.txt convote_v1.1/data_stage_one/development_set/05...

In class we had the texts variable. For the homework can just do speeches_df['content'] to get the same sort of list of stuff.

Take a look at the contents of the first 5 speeches


In [12]:
speech_num = 0
for speech in speeches_df['content'].head(5):
    speech_num += 1
    print(speech_num)
    print(speech)
    print('')


1
mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers , and that the presidency would be filled by an unelected member of the cabinet who not a single member of this country , not a single citizen , voted to fill that position , and that that person would have no checks and balances from congress for a period of 45 days i find extraordinary . 
i find it inconsistent . 
i find it illogical , and , frankly , i find it dangerous . 
the gentleman from wisconsin refused earlier to yield time , but i was going to ask him , if virginia has those elections in a shorter time period , they should be commended for that . 
so now we have a situation in the congress where the virginia delegation has sent their members here , but many other states do not have members here . 
do they at that point elect a speaker of the house in the absence of other members ? 
and then three more states elect their representatives , temporary replacements , or full replacements at that point . 
they come in . 
do they elect a new speaker ? 
and if that happens , who becomes the president under the succession act ? 
this bill does not address that question . 
this bill responds to real threats with fantasies . 
it responds with the fantasy , first of all , that a lot of people will still survive ; but we have no guarantee of that . 
it responds with the fantasy that those who do survive will do the right thing . 
we are here having this debate , we have debates every day , because people differ on what the right thing is to do . 
i have been in very traumatic situations with people in severe car wrecks and mountain climbing accidents . 
my experience has not been that crisis imbues universal sagacity and fairness . 
it has not been that . 
people respond in extraordinary ways , and we must preserve an institution that has the deliberative body and the checks and balances to meet those challenges . 
many of our states are going increasingly to mail-in ballots . 
we in this body were effectively disabled by an anthrax attack not long after september 11 . 
i would ask my dear friends , will you conduct this election in 45 days if there is anthrax in the mail and still preserve the franchise of the american people ? 
how will you do that ? 
you have no answer to that question . 
i find it extraordinary , frankly , that while saying you do not want to amend the constitution , we began this very congress by amending the constitution through the rule , by undermining the principle that a quorum is 50 percent of the body and instead saying it is however many people survive . 
and if that rule applies , who will designate it , who will implement it ? 
the speaker , or the speaker 's designee ? 
again , not an elected person , as you say is so critical and i believe is critical , but a temporary appointee , frankly , who not a single other member of this body knows who they are . 
so we not only have an unelected person , we have an unknown person who will convene this body , and who , by the way , could conceivably convene it for their own election to then become the president of the united states under the succession act . 
you have refused steadfastly to debate this real issue broadly . 
you had a mock debate in the committee on the judiciary in which the distinguished chairman presented my bill without allowing me the courtesy or dignity to defend it myself . 
and on that , you proudly say you defend democracy . 
sir , i think you dissemble in that regard . 
here is the fundamental question for us , my friends , and it is this : the american people are watching television and an announcement comes on and says the congress has been destroyed in a nuclear attack , the president and vice president are killed and the supreme court is dead and thousands of our citizens in this town are . 
what happens next ? 
under your bill , 45 days of chaos . 
apparently , according to the committee on the judiciary subcommittee on the constitution chairman , 45 days of marshal law , rule of this country by an unelected president with no checks and balances . 
or an alternative , an alternative which says quite simply that the people have entrusted the representatives they send here to make profound decisions , war , taxation , a host of other things , and those representatives would have the power under the bill of the gentleman from california ( mr. rohrabacher ) xz4003430 bill or mine to designate temporary successors , temporary , only until we can have a real election . 
the american people , in one scenario , are told we do not know who is going to run the country , we have no representatives ; where in another you will have temporary representatives carrying your interests to this great body while we deliberate and have real elections . 
that is the choice . 
you are making the wrong choice today if you think you have solved this problem . 


2
mr. chairman , i want to thank my good friend from california ( mr. rohrabacher ) xz4003430 . 
i will always remember that day , as we all will . 
his point is well taken . 
i understand there is good intent behind the bill before us today and the amendment , but it is not enough . 
it simply is not . 
it leaves our country vulnerable for 45 days and that is too long . 
the distinguished chairman of the committee on the judiciary made some comments recently that suggested that somehow terrorists would oppose this bill and by some implication would favor the bill the gentleman from california ( mr. rohrabacher ) xz4003430 and i have put forward because it seems to support their autocratic views of government . 
nothing could be further from the truth . 
in fact , what our bill would do is tell the terrorists , you could come on a single day and set off a nuclear weapon in this town and kill every single member of us ; and though we would be missed , the very next day the congress would be up and functioning with every single state , every single district having full representation by statesmen and stateswomen at a time of national crisis . 
that is what the gentleman from california ( mr. rohrabacher ) xz4003430 and i are trying to do . 
we are trying to tell the terrorists , you can kill all of us as individuals , but you will not defeat this institution . 
you will not defeat the principle of representation . 
you will not defeat the principles of checks and balances . 
you will not impose martial law . 
here is the irony . 
if terrorists hit us today when we finally vote on this , let us suppose a few democrats do not make it over here . 
you are leaving this country vulnerable to change in power . 
if the terrorists were to strike your conference retreat where the president speaks to the republican house and senate members and kill hundreds of house and senate members on the republican side , the democrats at that point claim the majority . 
the democrats at that point elect a speaker of the house . 
i am a democrat , for goodness sakes ; but that is not the way to leave our country vulnerable . 
you are leaving your own party , you are leaving the will of the people through their elections vulnerable . 
if we have temporary replacements , you immediately reconstitute the house ; you immediately ensure representation ; you assure that you maintain the balance of political power ; and you do it in an orderly , structured way with no chaos , in a way that is constitutionally valid by definition . 
what you have proposed is not necessarily constitutionally valid . 
it leaves the terrorists able to change our system of government . 
it depends on a fantasy immediate or quick election . 
it does not allow really qualified people necessarily to get here and act in time . 
there are so many things you have left undone . 
you are going to try to say that at the start of this year we have solved this problem ; let us go home . 
you have not solved the problem , and it is a doggone disgrace , and it is a danger to this country . 
the other day a gentleman testified before the committee on the budget and said this : `` the lack of preparation for continuity , for true continuity invites attack. '' you are inviting attack . 
not preventing attack . 


3
mr. chairman , i rise to make two fundamental points before we proceed to vote on this . 
the two points are these : this resolution does not solve the real problem and it may create more problems than it purports to solve , and we have to understand that . 
it does not solve the problem for this reason : by leaving us without a congress for 45 days , we essentially impose the opportunity for the executive branch to exert marshal law , and that is not what the framers of this country had in mind . 
this bill , if we do not provide some mechanism for prompt replacement other than this bill , will leave this country governed by an unelected executive , a cabinet member most likely who not a single american elected to that office . 
furthermore , it has a host of problems . 
it does not address the possibility that one delegation will elect its representatives more promptly than another . 
they will come to this body , choose one of its members as speaker . 
that person could move on to become the president . 
then another delegation comes in , et cetera . 
you are essentially leaving this country without a house of representatives , without checks and balances , without separation of powers , for at least 45 days , assuming an election can be held in 45 days and assuming that the terrorists through an anthrax attack , like they subjected this very capitol to , will not somehow undermine that ability . 
this is reality . 
we have seen the reality here . 
we saw those airplanes hit the buildings , we saw the anthrax , and yet we are not truly acting to solve this . 
mr. chairman , i yield to my distinguished friend , the gentleman from california ( mr. rohrabacher ) xz4003430 . 


4
mr. chairman , reclaiming my time , let me make two final points : one , the majority party must understand this : if you are at a republican conference retreat and terrorists should strike you and kill the president and vice president and significant numbers of your side of the aisle , the democrats under your proposed law will obtain the majority , will elect a speaker of the house , and that person will then become the president of the united states of america . 
you are leaving this country vulnerable to that . 
you must not do it . 
you must not . 
this matter must be taken seriously . 
it deserves full debate . 
whether it is the proposal of the gentleman from california ( mr. rohrabacher ) xz4003430 and mine or others , we should commit to having this full house seriously consider this . 
if we do not and we are not fortunate , history will not look kindly upon the jeopardy in which we have left this great nation . 
vote no on this bill and insist on true debate on true continuity of congress in a responsible way that protects the balance of power , assures real succession to the presidency , and , most importantly , assures that your constituents will have representation at a time when our nation may well go to nuclear war , institute a draft , appropriate trillions of dollars , suspend habeas corpus and impose marshal law . 
you do not want that . 
but if you stop at this bill , you leave this nation vulnerable . 
mr. chairman , if there is no one to speak in opposition , i ask unanimous consent to withdraw my preferential motion . 


5
mr. chairman , i thank my distinguished colleague , and i appreciate his leadership on this issue . 
the gentleman from california ( mr. rohrabacher ) xz4003430 spoke eloquently about the need for the rohrabacher/baird amendment ; and i would like to address it briefly , if i may . 
madison is quoted on this topic , but let me quote madison from federalist 47 . 
he said : `` the accumulation of all powers , legislative , executive , and judiciary in the same hands , whether of one , a few , or many , and whether hereditary , self-appointed , or elected , may justly be pronounced the very definition of tyranny. '' now , i would like , if i may , to ask my colleagues , before we pass this appropriations bill with legislative language in it alleging to maintain continuity , to maybe address a couple of questions , before my colleagues vote on this , and i will yield time . 
not for a filibuster , but just to address some questions . 
how will we , given madison 's concern , maintain checks and balances during the 49-day period until we have the special elections ? 
i would be happy to yield 30 seconds to anyone who plans to vote for this bill to address that question . 


Doing our analysis

Use the sklearn package and a plain boring CountVectorizer to get a list of all of the tokens used in the speeches. If it won't list them all, that's ok! Make a dataframe with those terms as columns.

Be sure to include English-language stopwords


In [13]:
count_vectorizer = CountVectorizer(stop_words = 'english')

In [17]:
count_vectorizer.get_feature_names()


Out[17]:
['000',
 '00007',
 '018',
 '050',
 '092',
 '10',
 '100',
 '106',
 '107',
 '108',
 '108th',
 '109th',
 '10th',
 '11',
 '110',
 '114',
 '117',
 '118',
 '11th',
 '12',
 '120',
 '121',
 '122',
 '123',
 '125',
 '128',
 '12898',
 '13',
 '13279',
 '1332',
 '1335',
 '1344',
 '135',
 '138',
 '14',
 '140',
 '143',
 '144',
 '145',
 '149',
 '1498',
 '14th',
 '15',
 '150',
 '1520',
 '153',
 '155',
 '159',
 '16',
 '160',
 '162',
 '163',
 '165',
 '1671',
 '1675',
 '17',
 '170',
 '1700',
 '174',
 '178',
 '1787',
 '17th',
 '18',
 '180',
 '1800',
 '1800s',
 '181',
 '1812',
 '1855',
 '186',
 '1868',
 '18th',
 '19',
 '190',
 '1907',
 '1922',
 '1927',
 '1930',
 '1940s',
 '1950s',
 '196',
 '1960',
 '1960s',
 '1964',
 '1965',
 '1967',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1976',
 '1979',
 '198',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '19th',
 '1st',
 '20',
 '200',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2011',
 '2016',
 '202',
 '2072',
 '20th',
 '21',
 '2123',
 '2132',
 '214',
 '216',
 '21st',
 '22',
 '220',
 '2210',
 '2217',
 '222',
 '223',
 '225',
 '226',
 '229',
 '23',
 '231',
 '2324',
 '234',
 '2361',
 '23rd',
 '24',
 '240',
 '241',
 '2411',
 '242',
 '2451',
 '248',
 '25',
 '250',
 '2586',
 '26',
 '261',
 '263',
 '2646',
 '26th',
 '27',
 '270',
 '273',
 '275',
 '278',
 '279',
 '28',
 '283',
 '2844',
 '286',
 '287',
 '2882',
 '2884',
 '2888',
 '29',
 '2904',
 '2926',
 '293',
 '2934',
 '2944',
 '297',
 '2975',
 '2985',
 '2d',
 '2nd',
 '30',
 '300',
 '3000',
 '3004',
 '3005',
 '3006',
 '301',
 '302',
 '303',
 '304',
 '305',
 '306',
 '3061',
 '309',
 '3090',
 '30s',
 '31',
 '310',
 '311',
 '3130',
 '3160',
 '3162',
 '317',
 '32',
 '3238',
 '327',
 '3283',
 '329',
 '33',
 '3306',
 '332',
 '336',
 '34',
 '340',
 '345',
 '35',
 '350',
 '352',
 '353',
 '36',
 '365',
 '37',
 '37th',
 '38',
 '383',
 '387',
 '388',
 '39',
 '397',
 '40',
 '400',
 '40th',
 '41',
 '413',
 '42',
 '420',
 '421',
 '427',
 '43',
 '435',
 '439',
 '44',
 '440',
 '442',
 '45',
 '450',
 '454',
 '455',
 '457',
 '4571',
 '461',
 '465',
 '469',
 '47',
 '479',
 '48',
 '482',
 '483',
 '487',
 '488',
 '49',
 '492',
 '4th',
 '50',
 '500',
 '501',
 '502',
 '5064',
 '508',
 '51',
 '5135',
 '52',
 '521',
 '525',
 '526',
 '53',
 '5304',
 '5305',
 '5306',
 '533',
 '53857',
 '539',
 '54',
 '543',
 '544',
 '55',
 '554',
 '562',
 '564',
 '57',
 '574',
 '58',
 '587',
 '589',
 '59',
 '5th',
 '60',
 '600',
 '604',
 '605',
 '6070',
 '609',
 '612',
 '62',
 '63',
 '6370',
 '639',
 '64',
 '641',
 '65',
 '650',
 '653',
 '66',
 '67',
 '670',
 '672',
 '675',
 '68',
 '69',
 '692',
 '698',
 '70',
 '700',
 '701',
 '702',
 '719',
 '72',
 '724',
 '74',
 '743',
 '75',
 '750',
 '751',
 '754',
 '778',
 '79',
 '80',
 '800',
 '82',
 '822',
 '83',
 '830',
 '831',
 '84',
 '8400',
 '841',
 '845',
 '8494',
 '85',
 '850',
 '865',
 '868',
 '87',
 '870',
 '90',
 '900',
 '91',
 '912',
 '924',
 '92nd',
 '93',
 '94',
 '9500',
 '96',
 '97',
 '970',
 '975',
 '97th',
 '98',
 '9849',
 '99',
 '994',
 '9th',
 '__',
 'aaron',
 'aba',
 'abandon',
 'abandoned',
 'abandoning',
 'abcs',
 'abet',
 'abhorrent',
 'abide',
 'abides',
 'abiding',
 'abilities',
 'ability',
 'able',
 'ably',
 'abolish',
 'abraham',
 'abridgement',
 'abroad',
 'abrogation',
 'absence',
 'absent',
 'absentee',
 'absolutely',
 'absolve',
 'absorb',
 'absurd',
 'abundance',
 'abundant',
 'abuse',
 'abused',
 'abuses',
 'abusing',
 'abusive',
 'abysmal',
 'academic',
 'academically',
 'academics',
 'academy',
 'accede',
 'accelerated',
 'accept',
 'acceptable',
 'acceptance',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accessible',
 'accessing',
 'accession',
 'accessioning',
 'accessories',
 'accident',
 'accidents',
 'acclaimed',
 'accommodate',
 'accommodated',
 'accommodating',
 'accompanies',
 'accompanying',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishment',
 'accordance',
 'according',
 'accordingly',
 'account',
 'accountability',
 'accountable',
 'accountant',
 'accounting',
 'accounts',
 'accumulated',
 'accumulation',
 'accurate',
 'accurately',
 'accusations',
 'accused',
 'accustom',
 'achieve',
 'achieved',
 'achievement',
 'achievements',
 'achieving',
 'acknowledge',
 'acknowledged',
 'acknowledges',
 'aclu',
 'acquainted',
 'acquire',
 'acquired',
 'acquisition',
 'acquisitions',
 'acre',
 'acres',
 'acronym',
 'act',
 'acted',
 'acting',
 'action',
 'actionable',
 'actions',
 'activate',
 'active',
 'actively',
 'activities',
 'activity',
 'actor',
 'actors',
 'acts',
 'actual',
 'actually',
 'ada',
 'adamantly',
 'adams',
 'adc',
 'add',
 'added',
 'addiction',
 'adding',
 'addition',
 'additional',
 'additionally',
 'additions',
 'address',
 'addressed',
 'addresses',
 'addressing',
 'adds',
 'adequate',
 'adequately',
 'adhere',
 'adherents',
 'adhering',
 'adjacent',
 'adjourn',
 'adjournment',
 'adjudicated',
 'adjust',
 'adjusted',
 'adjustment',
 'adjustments',
 'administer',
 'administered',
 'administering',
 'administration',
 'administrations',
 'administrative',
 'administrator',
 'administrators',
 'admirable',
 'admire',
 'admission',
 'admit',
 'admitted',
 'admittedly',
 'admitting',
 'adolescence',
 'adopt',
 'adopted',
 'adopting',
 'adoption',
 'adoptions',
 'ads',
 'adult',
 'adults',
 'advance',
 'advanced',
 'advancement',
 'advancements',
 'advances',
 'advancing',
 'advantage',
 'advantaged',
 'advantages',
 'adventure',
 'adversary',
 'adverse',
 'adversely',
 'advertised',
 'advice',
 'advise',
 'advised',
 'advisor',
 'advisories',
 'advisory',
 'advocacy',
 'advocate',
 'advocated',
 'advocates',
 'aesthetic',
 'affairs',
 'affect',
 'affected',
 'affecting',
 'affects',
 'affiliated',
 'affiliation',
 'affirm',
 'affirmative',
 'affirmatively',
 'affirmed',
 'affirms',
 'affluent',
 'afford',
 'affordable',
 'afforded',
 'affording',
 'affront',
 'afghanistan',
 'afl',
 'aforementioned',
 'afraid',
 'africa',
 'african',
 'afscme',
 'aftermarket',
 'aftermath',
 'afternoon',
 'age',
 'aged',
 'agencies',
 'agency',
 'agenda',
 'agendas',
 'agents',
 'ages',
 'aggressively',
 'aggrieved',
 'ago',
 'agony',
 'agree',
 'agreed',
 'agreeing',
 'agreement',
 'agreements',
 'agrees',
 'agricultural',
 'agriculture',
 'aha',
 'ahead',
 'ahs',
 'aid',
 'aide',
 'aided',
 'aiding',
 'aim',
 'aimed',
 'aims',
 'air',
 'airing',
 'airline',
 'airplanes',
 'aisle',
 'ak',
 'akin',
 'akron',
 'al',
 'alabama',
 'alan',
 'alarm',
 'alarming',
 'alaska',
 'alaskan',
 'albany',
 'alcee',
 'aldebron',
 'alerted',
 'alexander',
 'alexandria',
 'alfred',
 'alice',
 'aliens',
 'align',
 'aligned',
 'aligns',
 'alike',
 'alive',
 'allegations',
 'allege',
 'alleged',
 'allegedly',
 'allegiance',
 'alleging',
 'alleviate',
 'alliance',
 'allied',
 'allocate',
 'allocation',
 'allocations',
 'allotment',
 'allotted',
 'allow',
 'allowable',
 'allowed',
 'allowing',
 'allows',
 'alluded',
 'almonds',
 'alphabet',
 'altamonte',
 'alter',
 'altered',
 'alternate',
 'alternates',
 'alternative',
 'alternatives',
 'alto',
 'amaze',
 'amazing',
 'ambassador',
 'ambulances',
 'ameliorate',
 'amend',
 'amendable',
 'amended',
 'amending',
 'amendment',
 'amendments',
 'america',
 'american',
 'americans',
 'amos',
 'amounting',
 'amounts',
 'amp',
 'ample',
 'amt',
 'analysis',
 'analyst',
 'analyze',
 'anathema',
 'anderson',
 'andrea',
 'andrews',
 'anecdotes',
 'angela',
 'angeles',
 'angry',
 'angst',
 'anguish',
 'animal',
 'animals',
 'animated',
 'ann',
 'anna',
 'annie',
 'annihilation',
 'anniston',
 'announce',
 'announced',
 'announcement',
 'annual',
 'annually',
 'anonymous',
 'ansje',
 'answer',
 'answered',
 'answers',
 'antagonizing',
 'antelope',
 'anthony',
 'anthrax',
 'anti',
 'anticipate',
 'anticipated',
 'anticipates',
 'antidumping',
 'antietam',
 'antiforum',
 'antimiscegenation',
 'antipathy',
 'antiquated',
 'antonio',
 'anxiety',
 'anxious',
 'anybody',
 'anymore',
 'anyplace',
 'anytime',
 'aoc',
 'apa',
 'apart',
 'apathy',
 'apologize',
 'apostle',
 'apparel',
 'apparent',
 'apparently',
 'appeal',
 'appealed',
 'appeals',
 'appear',
 'appeared',
 'appears',
 'appendices',
 'applaud',
 'appliance',
 'applicability',
 'applicable',
 'applicants',
 'application',
 'applied',
 'applies',
 'apply',
 'applying',
 'appoint',
 'appointed',
 'appointee',
 'appointing',
 'appointment',
 'appointments',
 'appreciably',
 'appreciate',
 'appreciated',
 'appreciates',
 'appreciation',
 'appreciative',
 'approach',
 'approached',
 'approaches',
 'appropriate',
 'appropriated',
 'appropriately',
 'appropriates',
 'appropriation',
 'appropriations',
 'appropriators',
 'approval',
 'approve',
 'approved',
 'approving',
 'approximately',
 'april',
 'aptly',
 'aquatic',
 'ar',
 'arab',
 'arabia',
 'arbitrary',
 'arc',
 'architect',
 'architects',
 'architectural',
 'architecture',
 'ardently',
 'ardmore',
 'area',
 'areas',
 'arena',
 'argentina',
 'argue',
 'argued',
 'arguing',
 'argument',
 'argumentative',
 'arguments',
 'arid',
 'arise',
 'arisen',
 'arising',
 'aristocracies',
 'aristocracy',
 'arizona',
 'arkansas',
 'arm',
 'armed',
 'armies',
 'armor',
 'arms',
 'army',
 'arnold',
 'arnolds',
 'arrange',
 'arrangements',
 'array',
 'arrays',
 'arrested',
 'arrests',
 'arrival',
 'arrived',
 'arrogance',
 'arrogant',
 'arsenal',
 'art',
 'article',
 'articles',
 'articulate',
 'artificial',
 'artificially',
 'arts',
 'ascertain',
 'asfe',
 'asian',
 'aside',
 'asides',
 'ask',
 'asked',
 'asking',
 'asks',
 'aspect',
 'aspects',
 'asphyxiating',
 'assault',
 'assemble',
 'assembly',
 'assert',
 'asserted',
 'assertion',
 'assertions',
 'assess',
 'assessed',
 'assessing',
 'assessment',
 'assessments',
 'assets',
 'assigned',
 'assignment',
 'assigns',
 'assimilating',
 'assist',
 'assistance',
 'assistant',
 'assisted',
 'assisting',
 'assists',
 'assoc',
 'associate',
 'associated',
 'associates',
 'association',
 'associations',
 'assume',
 'assumed',
 'assumes',
 'assuming',
 'assumption',
 'assumptions',
 'assurance',
 'assurances',
 'assure',
 'assured',
 'assures',
 'assuring',
 'asthma',
 'astounding',
 'astronaut',
 'astronomical',
 'athletic',
 'atkins',
 'atla',
 'atlanta',
 'atm',
 'atmosphere',
 'attach',
 'attached',
 'attaching',
 'attack',
 'attacked',
 'attacks',
 'attain',
 'attainable',
 'attaining',
 'attempt',
 'attempted',
 'attempting',
 'attempts',
 'attend',
 'attended',
 'attending',
 'attention',
 'attest',
 'attitude',
 'attorney',
 'attorneys',
 'attract',
 'audit',
 'audited',
 'auditing',
 'audits',
 'august',
 'austin',
 'australia',
 'authentic',
 'author',
 'authoring',
 'authorities',
 'authority',
 'authorization',
 'authorizations',
 'authorize',
 'authorized',
 'authorizes',
 'authorizing',
 'authors',
 'auto',
 'autocratic',
 'automatic',
 'automatically',
 'automobile',
 'automotive',
 'autonomy',
 'avail',
 'availability',
 'available',
 'avalanche',
 'avenue',
 'average',
 'aviation',
 'avoid',
 ...]

In [14]:
X = count_vectorizer.fit_transform(speeches_df['content'])

In [18]:
pd.DataFrame(X.toarray(), columns = count_vectorizer.get_feature_names())


Out[18]:
000 00007 018 050 092 10 100 106 107 108 ... youngsters youth yuan zero zeroing zeros zigler zirkin zoe zoellick
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 3 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
25 2 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
29 2 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
672 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
673 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
674 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
675 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
676 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
677 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
678 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
679 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
680 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
681 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
682 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
683 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
684 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
685 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
686 1 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
687 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
688 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
689 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
690 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
691 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
692 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
693 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
694 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
695 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
696 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
697 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
698 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
699 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
700 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
701 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

702 rows × 9106 columns

Okay, it's far too big to even look at. Let's try to get a list of features from a new CountVectorizer that only takes the top 100 words.

Now let's push all of that into a dataframe with nicely named columns.


In [21]:
from nltk.stem.porter import PorterStemmer
import re

porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

count_vectorizer = CountVectorizer(stop_words = 'english', tokenizer = stemming_tokenizer, max_features = 100)
X = count_vectorizer.fit_transform(speeches_df['content'])
pd.DataFrame(X.toarray(), columns = count_vectorizer.get_feature_names())


Out[21]:
1 2 act allow amend american amp ani appropri associ ... urg use veri vote wa want way work year yield
0 0 0 3 1 2 3 0 0 0 0 ... 0 0 2 1 1 1 2 0 0 2
1 0 0 1 1 1 0 0 0 0 0 ... 0 0 1 1 0 1 3 0 1 0
2 0 0 1 0 0 1 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 1 0 1 1 0 0 0
4 0 0 0 0 1 0 0 0 1 0 ... 0 0 1 2 0 0 0 0 0 2
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 1 0 3 0 0 1 0 ... 0 0 0 2 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 1 1 5 2 0 0 0 0 ... 0 0 0 1 12 2 1 0 0 1
13 0 0 8 0 1 0 0 0 11 0 ... 0 2 0 2 0 0 1 0 5 1
14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 1 0 ... 1 1 0 0 1 0 0 3 2 0
16 0 0 0 0 2 0 0 1 1 0 ... 0 0 0 0 0 1 0 1 2 0
17 0 0 0 2 4 0 0 0 0 0 ... 0 0 0 1 3 4 2 3 3 1
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
20 1 0 1 1 0 3 0 1 0 0 ... 1 0 2 0 5 0 0 0 1 0
21 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 1 0 1
22 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 1 0 0 1
23 0 0 0 1 4 0 0 0 0 0 ... 1 1 0 0 1 1 0 0 0 1
24 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
25 0 1 0 0 1 1 0 1 0 0 ... 0 1 0 5 1 0 0 0 1 2
26 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 2
27 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
28 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
29 0 0 0 0 1 0 0 0 0 0 ... 0 1 2 0 0 0 0 3 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
672 0 0 1 0 7 0 0 4 2 1 ... 1 0 1 0 0 0 0 0 0 1
673 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
674 0 1 0 0 1 0 0 2 1 0 ... 0 0 0 0 0 2 0 0 0 1
675 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
676 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
677 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
678 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
679 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 2 0 0 0 2
680 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
681 0 0 1 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
682 0 1 3 0 0 5 0 5 0 7 ... 0 0 0 1 5 1 0 0 3 1
683 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
684 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
685 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
686 0 0 2 2 0 0 0 1 0 0 ... 1 0 0 0 3 0 0 0 1 1
687 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
688 0 0 0 0 6 0 0 0 0 0 ... 1 0 0 0 1 0 0 0 0 2
689 1 0 2 2 4 0 0 0 1 3 ... 1 0 1 1 1 0 0 0 0 1
690 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 2 0 0 0 2
691 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
692 0 1 0 2 7 1 0 3 0 1 ... 1 3 1 1 4 0 0 0 1 2
693 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
694 0 0 2 0 0 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
695 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
696 0 0 1 0 0 2 0 1 0 0 ... 0 1 0 1 0 1 0 0 0 0
697 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
698 0 0 0 0 2 4 0 0 0 0 ... 0 0 0 2 2 0 3 0 0 0
699 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
700 0 0 2 0 0 0 0 2 0 0 ... 1 0 0 4 0 2 2 2 0 0
701 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0

702 rows × 100 columns

Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and many don't mention "chairman" and how many mention neither "mr" nor "chairman"?


In [23]:
# how many speeches are there total
len(speeches_df['content'])


Out[23]:
702

In [26]:
# how many speeches don't mention "chairman"
len(speeches_df[speeches_df['content'].str.contains('chairman') == False])


Out[26]:
249

In [43]:
# how many speeches don't mention "chairman" OR "mr."
len(speeches_df[(speeches_df['content'].str.contains('chairman') == False) & (speeches_df['content'].str.contains('mr.') == False)])


Out[43]:
75

What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?


In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [46]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, max_features = 100, use_idf=False, norm='l1')
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
term_freq = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

In [52]:
term_freq['thank'].sort_values(ascending = False).head(1)


Out[52]:
179    0.25
Name: thank, dtype: float64

If I'm searching for China and trade, what are the top 3 speeches to read according to the CountVectoriser?


In [58]:
(term_freq['china'] + term_freq['trade']).sort_values(ascending = False).head(3)


Out[58]:
345    0.397059
336    0.281250
402    0.250000
dtype: float64

Now what if I'm using a TfidfVectorizer?


In [53]:
l2_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, max_features = 100)
X = l2_vectorizer.fit_transform(speeches_df['content'])
l2_df = pd.DataFrame(X.toarray(), columns=l2_vectorizer.get_feature_names())


Out[53]:
1 2 act allow amend american amp ani appropri associ ... urg use veri vote wa want way work year yield
0 0.000000 0.000000 0.096449 0.031370 0.045459 0.102570 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.059449 0.028406 0.027954 0.030172 0.064842 0.000000 0.000000 0.037749
1 0.000000 0.000000 0.072345 0.070591 0.051147 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.066888 0.063921 0.000000 0.067896 0.218866 0.000000 0.066236 0.000000
2 0.000000 0.000000 0.087605 0.000000 0.000000 0.093165 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.080997 0.077405 0.000000 0.000000 0.000000 0.000000 0.000000 0.051431
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.135191 0.000000 ... 0.000000 0.000000 0.000000 0.104059 0.000000 0.110529 0.118766 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.101489 0.000000 0.0 0.000000 0.164781 0.000000 ... 0.000000 0.000000 0.132722 0.253671 0.000000 0.000000 0.000000 0.000000 0.000000 0.168549
5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.693326
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.410171 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.693326
8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.145293 0.000000 0.475061 0.0 0.000000 0.170927 0.000000 ... 0.000000 0.000000 0.000000 0.263132 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
10 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
11 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
12 0.000000 0.000000 0.034695 0.033854 0.122647 0.073795 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.030656 0.362016 0.065123 0.034988 0.000000 0.000000 0.020369
13 0.000000 0.000000 0.235060 0.000000 0.020773 0.000000 0.0 0.000000 0.371011 0.000000 ... 0.000000 0.063282 0.000000 0.051923 0.000000 0.000000 0.029630 0.000000 0.134507 0.017250
14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.788324 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.093330 0.000000 ... 0.084901 0.087554 0.000000 0.000000 0.070696 0.000000 0.000000 0.228915 0.148880 0.000000
16 0.000000 0.000000 0.000000 0.000000 0.122494 0.000000 0.0 0.089923 0.099443 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.081302 0.000000 0.081302 0.158630 0.000000
17 0.000000 0.000000 0.000000 0.074535 0.108010 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.033746 0.099629 0.143378 0.077031 0.107534 0.104905 0.022422
18 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.369519
20 0.037501 0.000000 0.033592 0.032778 0.000000 0.107173 0.0 0.034869 0.000000 0.000000 ... 0.035078 0.000000 0.062117 0.000000 0.146045 0.000000 0.000000 0.000000 0.030756 0.000000
21 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.232980 0.000000 0.232980 0.000000 0.145740
22 0.000000 0.000000 0.000000 0.000000 0.101942 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.125375 0.000000 0.145407 0.000000 0.000000 0.084650
23 0.000000 0.000000 0.000000 0.156167 0.452612 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.167125 0.172349 0.000000 0.000000 0.139163 0.150205 0.000000 0.000000 0.000000 0.093960
24 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25 0.000000 0.085938 0.000000 0.000000 0.057484 0.086468 0.0 0.084399 0.000000 0.000000 ... 0.000000 0.087557 0.000000 0.359203 0.070698 0.000000 0.000000 0.000000 0.074442 0.095468
26 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.289492 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.325513
27 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.714499 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
28 0.000000 0.000000 0.000000 0.000000 0.494208 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
29 0.000000 0.000000 0.000000 0.000000 0.041446 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.063128 0.108402 0.000000 0.000000 0.000000 0.000000 0.165052 0.053673 0.034416
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
672 0.000000 0.000000 0.029296 0.000000 0.144984 0.000000 0.0 0.121638 0.067257 0.044994 ... 0.030591 0.000000 0.027086 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.017199
673 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.446466
674 0.000000 0.080445 0.000000 0.000000 0.053811 0.000000 0.0 0.158010 0.087369 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.142862 0.000000 0.000000 0.000000 0.044683
675 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.539227
676 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
677 0.000000 0.000000 0.000000 0.000000 0.246659 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
678 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.484423 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
679 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.075092 0.000000 0.000000 0.000000 0.000000 0.134979 0.000000 0.000000 0.000000 0.084436
680 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.788324 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
681 0.000000 0.000000 0.106532 0.103949 0.000000 0.113294 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
682 0.000000 0.039259 0.111432 0.000000 0.000000 0.197507 0.0 0.192781 0.000000 0.399338 ... 0.000000 0.000000 0.000000 0.032819 0.161486 0.034860 0.000000 0.000000 0.102023 0.021806
683 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.322722
684 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.275157
685 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
686 0.000000 0.000000 0.066143 0.064539 0.000000 0.000000 0.0 0.034329 0.000000 0.000000 ... 0.034534 0.000000 0.000000 0.000000 0.086268 0.000000 0.000000 0.000000 0.030279 0.019415
687 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.416747
688 0.000000 0.000000 0.000000 0.000000 0.337679 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.083124 0.000000 0.000000 0.000000 0.069217 0.000000 0.000000 0.000000 0.000000 0.093468
689 0.042695 0.000000 0.076490 0.074636 0.108157 0.000000 0.0 0.000000 0.043902 0.176219 ... 0.039936 0.000000 0.035360 0.033792 0.033255 0.000000 0.000000 0.000000 0.000000 0.022453
690 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.373497 0.000000 0.000000 0.000000 0.233640
691 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
692 0.000000 0.031968 0.000000 0.059024 0.149684 0.032165 0.0 0.094186 0.000000 0.046453 ... 0.031583 0.097710 0.027964 0.026724 0.105195 0.000000 0.000000 0.000000 0.027692 0.035513
693 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
694 0.000000 0.000000 0.160054 0.000000 0.000000 0.085106 0.0 0.000000 0.000000 0.000000 ... 0.083566 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.046982
695 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.714499 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
696 0.000000 0.000000 0.103849 0.000000 0.000000 0.220880 0.0 0.107797 0.000000 0.000000 ... 0.000000 0.111831 0.000000 0.091757 0.000000 0.097463 0.000000 0.000000 0.000000 0.000000
697 0.000000 0.000000 0.061674 0.060178 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.062194 0.000000 0.000000 0.000000
698 0.000000 0.000000 0.000000 0.000000 0.077355 0.232715 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.096674 0.095136 0.000000 0.165506 0.000000 0.000000 0.000000
699 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
700 0.000000 0.000000 0.088101 0.000000 0.000000 0.000000 0.0 0.091450 0.000000 0.000000 ... 0.045998 0.000000 0.000000 0.155685 0.000000 0.082683 0.088844 0.082683 0.000000 0.000000
701 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.714499 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

702 rows × 100 columns


In [59]:
(l2_df['china'] + l2_df['trade']).sort_values(ascending = False).head(3)


Out[59]:
345    1.296847
402    1.207024
317    1.202438
dtype: float64

What's the content of the speeches? Here's a way to get them:


In [60]:
# index 0 is the first speech, which was the first one imported.
paths[0]


Out[60]:
'convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt'

In [61]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}


mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers , and that the presidency would be filled by an unelected member of the cabinet who not a single member of this country , not a single citizen , voted to fill that position , and that that person would have no checks and balances from congress for a period of 45 days i find extraordinary . 
i find it inconsistent . 
i find it illogical , and , frankly , i find it dangerous . 
the gentleman from wisconsin refused earlier to yield time , but i was going to ask him , if virginia has those elections in a shorter time period , they should be commended for that . 
so now we have a situation in the congress where the virginia delegation has sent their members here , but many other states do not have members here . 
do they at that point elect a speaker of the house in the absence of other members ? 
and then three more states elect their representatives , temporary replacements , or full replacements at that point . 
they come in . 
do they elect a new speaker ? 
and if that happens , who becomes the president under the succession act ? 
this bill does not address that question . 
this bill responds to real threats with fantasies . 
it responds with the fantasy , first of all , that a lot of people will still survive ; but we have no guarantee of that . 
it responds with the fantasy that those who do survive will do the right thing . 
we are here having this debate , we have debates every day , because people differ on what the right thing is to do . 
i have been in very traumatic situations with people in severe car wrecks and mountain climbing accidents . 
my experience has not been that crisis imbues universal sagacity and fairness . 
it has not been that . 
people respond in extraordinary ways , and we must preserve an institution that has the deliberative body and the checks and balances to meet those challenges . 
many of our states are going increasingly to mail-in ballots . 
we in this body were effectively disabled by an anthrax attack not long after september 11 . 
i would ask my dear friends , will you conduct this election in 45 days if there is anthrax in the mail and still preserve the franchise of the american people ? 
how will you do that ? 
you have no answer to that question . 
i find it extraordinary , frankly , that while saying you do not want to amend the constitution , we began this very congress by amending the constitution through the rule , by undermining the principle that a quorum is 50 percent of the body and instead saying it is however many people survive . 
and if that rule applies , who will designate it , who will implement it ? 
the speaker , or the speaker 's designee ? 
again , not an elected person , as you say is so critical and i believe is critical , but a temporary appointee , frankly , who not a single other member of this body knows who they are . 
so we not only have an unelected person , we have an unknown person who will convene this body , and who , by the way , could conceivably convene it for their own election to then become the president of the united states under the succession act . 
you have refused steadfastly to debate this real issue broadly . 
you had a mock debate in the committee on the judiciary in which the distinguished chairman presented my bill without allowing me the courtesy or dignity to defend it myself . 
and on that , you proudly say you defend democracy . 
sir , i think you dissemble in that regard . 
here is the fundamental question for us , my friends , and it is this : the american people are watching television and an announcement comes on and says the congress has been destroyed in a nuclear attack , the president and vice president are killed and the supreme court is dead and thousands of our citizens in this town are . 
what happens next ? 
under your bill , 45 days of chaos . 
apparently , according to the committee on the judiciary subcommittee on the constitution chairman , 45 days of marshal law , rule of this country by an unelected president with no checks and balances . 
or an alternative , an alternative which says quite simply that the people have entrusted the representatives they send here to make profound decisions , war , taxation , a host of other things , and those representatives would have the power under the bill of the gentleman from california ( mr. rohrabacher ) xz4003430 bill or mine to designate temporary successors , temporary , only until we can have a real election . 
the american people , in one scenario , are told we do not know who is going to run the country , we have no representatives ; where in another you will have temporary representatives carrying your interests to this great body while we deliberate and have real elections . 
that is the choice . 
you are making the wrong choice today if you think you have solved this problem . 

Now search for something else! Another two terms that might show up. elections and chaos? Whatever you thnik might be interesting.


In [65]:
(l2_df['elect'] + l2_df['children']).sort_values(ascending = False).head(3)


Out[65]:
124    0.797181
25     0.760251
142    0.723351
dtype: float64

Enough of this garbage, let's cluster

Using a simple counting vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.

Using a term frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.

Using a term frequency inverse document frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.


In [84]:
# Simple counting vectorizer
vectorizer = CountVectorizer(tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000)
X = vectorizer.fit_transform(speeches_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))


Top terms per cluster:
Cluster 0: thi mr chairman amend gentleman
Cluster 1: head start program right religi
Cluster 2: amp nbsp p gt lt
Cluster 3: thi mr state s time
Cluster 4: associ nation restaur contractor chamber
Cluster 5: start head program children thi
Cluster 6: church wa s embezzl financi
Cluster 7: rule 11 state feder court

In [85]:
# Term frequency vectorizer
vectorizer = TfidfVectorizer(use_idf=False, tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000, norm = 'l1')
X = vectorizer.fit_transform(speeches_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))


Top terms per cluster:
Cluster 0: speaker mr time balanc reserv
Cluster 1: thi mr chairman gentleman amend
Cluster 2: mr yield chairman gentleman minut
Cluster 3: chairman mr time balanc yield
Cluster 4: yield mr gentleman chairman speaker
Cluster 5: mr demand vote record chairman
Cluster 6: yield gentleman texa wisconsin illinoi
Cluster 7: amend chairman mr opposit time

In [86]:
# Term frequency inverse document frequency vectorizer
vectorizer = TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000)
X = vectorizer.fit_transform(speeches_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))


Top terms per cluster:
Cluster 0: start head program children thi
Cluster 1: balanc time chairman mr reserv
Cluster 2: china trade thi s enforc
Cluster 3: demand record vote mr chairman
Cluster 4: consent claim opposit unanim ask
Cluster 5: gentleman yield mr chairman texa
Cluster 6: thi mr amend chairman time
Cluster 7: mr minut yield chairman gentleman

Which one do you think works the best?

One of the second two, but it's hard to tell which.

Harry Potter time

I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.

I want you to read them in, vectorize them and cluster them. Use this process to find out the two types of Harry Potter fanfiction. What is your hypothesis?


In [78]:
paths = glob.glob('hp/*')
paths[:5]


Out[78]:
['hp/10001898.txt',
 'hp/10004131.txt',
 'hp/10004927.txt',
 'hp/10007980.txt',
 'hp/10010343.txt']

In [79]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
hp_df = pd.DataFrame(speeches)
hp_df.head()


Out[79]:
content filename pathname
0 Prologue: The MissionDisclaimer: All character... 10001898.txt hp/10001898.txt
1 BlackDisclaimer: I do not own Harry PotterAuth... 10004131.txt hp/10004131.txt
2 Chapter 1"I'm pregnant.""""Mum please say some... 10004927.txt hp/10004927.txt
3 Author's Note: Hey, just so you know, this is ... 10007980.txt hp/10007980.txt
4 Disclaimer: I do not own Harry Potter and frie... 10010343.txt hp/10010343.txt

In [82]:
# # The two clusters for this are 
# # Top terms per cluster:
# # Cluster 0: wa hi harri hermion t
# # Cluster 1: wa hi t s lili
# # Which is unintelligible thing to me

# vectorizer = TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000)
# X = vectorizer.fit_transform(hp_df['content'])

# from sklearn.cluster import KMeans

# number_of_clusters = 2
# km = KMeans(n_clusters=number_of_clusters)
# km.fit(X)

# print("Top terms per cluster:")
# order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# terms = vectorizer.get_feature_names()
# for i in range(number_of_clusters):
#     top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
#     print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))


Top terms per cluster:
Cluster 0: wa hi harri hermion t
Cluster 1: wa hi t s lili

In [83]:
vectorizer = TfidfVectorizer(stop_words='english', max_features = 10000)
X = vectorizer.fit_transform(hp_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 2
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))


Top terms per cluster:
Cluster 0: lily james sirius remus said
Cluster 1: harry hermione draco said just

In [ ]: